Final project for the Introduction to Data Science / Text as Data class

By Adelaida Barrera (), Natalia Mejía, (), Mariana Saldarriaga () and Isabel de Brigard ()

What we set out to do

This semester we seem to be uninterruptedly glued to our screens. From zoom, to R, more and more of our days are spent in front of our phones or computers. But this trend, though exacerbated by the unusual conditions of 2020, did not start with some flying rodent far away. Digital platforms have carved up more and more of our time, and seem to direct more of our actions. As policy students, we were interested in one instance where this seems to be happening very notably: the way twitter communicates, condenses, and shapes public discourse around salient policy issues -guess it’s not procrastination if you can call it research.

From the broad question of “how does Twitter shape public discourse in our country?”, we decided to explore the public discourse around feminism and gender issues revealed by a select group of twitter accounts of activists, political leaders, writers, and all around opinion shapers in Colombia, where the four of us come from. We wanted to flex our newly acquired data science and text-as-data analysis muscles in a descriptive exercise.

In what follows you will find, first, a brief section on how we chose the accounts and how we tried to tame and clean the data. Then, with the help of some bi-grams and topic modelling, we tried to identify what these accounts actually talk about. We then attempted to understand how they relate to or differ from one another, with some exercises on scaling and network analysis. Finally, through sentiment analysis, we explored how do these tweeters feel about a couple of somewhat controversial topics.

A few caveats before we begin.

We are aware that the results we got are not always easily interpretable. For the most part, what we did was (for us) an interesting exploration of the methods we learned, which provides some insight into what this large set of tweets is about. Ultimately, this is an initial ‘distant reading’ of a massive amount of information that could inform further analysis and new questions on the twitter discussion on gender in the country.

The assumptions that we need to make to take a ‘text as data’ approach are strong and not always easily fulfilled and we believe some of them might be violated in the data we could acquire for this exercise.

  • First, through this exercise we learned that, unlike text data from political manifestos or speeches from a parliamentary debate, these tweeters write about a very wide range of issues, and for diverse purposes: expressing opinions on a matter, inviting the audience to read something, invite people to an event, make jokes, etc.

This led to great difficulty in subsetting the tweets that were actually expressing opinions on gender issues. We decided not to scrape tweets based on a hashtag -which could narrow the latent ‘messages’ or ‘topics’- because we were interested in the “elite” conversation in Twitter, and opinion leaders rarely use these hashtags. (Probably they do not need to use hashtags to join a conversation because they are at the center of it) .

We also decided not to create a dictionary (list of words) to filter out the tweets associated with gender issues or specific topics because for this exercise we preferred an exploratory and descriptive approach, rather than assume we knew what these feminists were talking about and introduce bias according to how we thing people talk.

  • Second, tweets are very short texts or ‘documents’, which means there is very little information and variation within each text. This is a limitation for the methods we used to classify them into topics or find relative differences between the them purely based on the words they use –which is what treating text as data does.

We believe these limitations affected the interpretability of our results and we would suggest this is so beacause the words we looked at are generated in a less stable context, and with less conventions on which words to use for expressing a certain message or position than, say, political speeches at parliament.

Despite all this, we do think we can learn some interesting things about how Twitter is used by prominent feminist individuals and institutions in Colombia.

The sample

Choosing accounts and scraping the data

Our initial intuition was that certain twitter accounts shape public discourse and that gathering those would give us a balanced and relatively complete picture of what most twitter-talk was about. This is partly the idea behind the Cifras y conceptos, opinion leaders panel, that traces the opinion of various individuals on a wide range of topics. These opinion leaders, they say, “differ from public opinion in general, because they are the ones who guide the climate of opinion, have the capacity for foresight and influence political issues and issues on the national agenda” and so tracing their points of view should be telling of more than their personal standing on a given topic.

So we dove into the twitterverse to see who came out to greet us. With a combination of research, personal experience and consultation to two prominent public figures ([@lasigualadas](https://twitter.com/lasigualadas?lang=en) and [@gloriasusanaesquivel](https://twitter.com/gsesquivel) in the Colombian feminist sphere we came up with 69 individuals and 39 institutional accounts that seemed key to be included if we were interested in what was being said about feminism and gender issues. We tried to create a sample of accounts that were identified by these ‘experts’ as indeed belonging to one same conversational space in Twitter (trying to get somewhat a stable discursive context) and that were ideologically diverse, so we explicitly asked to include ‘opponents’ in terms of gender discussions but also political ideology. This is, of course, not a representative sample of Colombian society and not even of the feminist movement in the country since it was not chosen at random, and people self-select into writing inTwitter. But that is an issue with all Twitter analysis (See for example Barberá, 2015).

So, aware of the fact that this was not a complete, balanced, objective picture of the public discourse on twitter on these issues, we decided to keep going with what we had. This was our thinking: our agonizing about how bad our selection was only clarified further what our data science professors have been telling us since Stats I: fancy analytical tools only get you so far. If you actually want to be able to say something about the world, you need to work on your theory. Really work on it. But we felt this was an exercise about the tools we had learned. The tools, not the theory. And for that -to try our hand on a limited sample- we had enough.

We built a small data frame with the real names, usernames and a couple of covariates for the individual accounts (occupation and institutions). We set up our API authorization and scraped their timelines using the twitteR package and the basic (free) Twitter API, which allows to gather the latest 3200 tweets from each account . You can find the code for it (without our Twitter keys) here and here. This gave us an initial tweet count of ~231,000 for individual accounts and ~93,000 for institutions, which seemed like a decent amount of text to begin with. But what a beautiful mess we got.

What did our data look like?

In our initial exploration of the data, we looked at the average tweets per individual account and the less recent tweet by account, since we knew the less frequent Twitter users would have much older tweets, which we can see below in the plotted frequency of tweets across time (left plot). We thus limited our data to tweets from the last 6 months and plotted those (right plot). This produced a much more balanced sample, with 67 individual accounts and 116.402 tweets.





From this sample, we then removed 22 accounts from congresswomen. We decided to do this after having run a topic model with a random sample of 7000 tweets -which was already a stretch for our 2012 laptops. Although we knew this had implications for what we would be able to say about in our analysis later, these accounts had too much content pertaining to topics other than gender / feminism and including them would have made it even harder to get a sense of what public discourse around these issues actually is.

Finally, we restricted the institutional accounts to match the same period we had chosen for the individual ones, and ended up with 39 accounts and 25.941 tweets. Here, because the institutions we had chosen are explicitly dedicated to the topics we were interested in, there was no need to leave anyone out. Institutions are, well, more institutional…





Cleaning the data

With our data ready and the help of quantada, we created a corpus. Finally, text was data. And so we did what any text miner would do: we got our rags and buckets out, put our aprons on, and began cleaning.
We removed stop words (both those that come in the tm package, as some we compiled in our own list), punctuation, numbers, and symbols. Then we removed mentions: we were after the what is what, more than the who is who of Colombian feminist tweeter. (And we would get to connections later on, with the network analysis.) Next were hashtags. Here, again, we understood that this would limit our analysis somewhat, but we felt we had a solid theory based reason for it. So we had that going for us, which is nice. The reason is that hashtags tend to work globally, as a shortcut to the apparently borderless internet conversacion. And we felt including them might disrupt the picture of the more local discourse we were trying to paint. (Esto no creo que esté bien explicado.)

And then, we did it all over again for our institutional accounts. By this time, this project was beginning to feel a little like what we figure raising twins must be like: you do a lot of cleaning. And you do it all twice. But we had gotten this far. And we were finally ready to see what all these tweets were about.

What are these tweets about?

What are the most common expressions in the network?

Before modeling, we tried to visualize the data we had. We wanted to observe the data structures to look for text relations. Following certain techniques found online, like the one employed by Orduz (2018), we performed a network analysis. This will allow us to understand graphically the tweets text as a weighted network.

As a first exploration, we saw the pairwise relative occurrence of words. We did a bi-gram analysis for individuals and institutional accounts. We created the bi-grams and did the respective cleaning of them (remove stopwords, https, emoticons, pair of words not relevant, etc.). Afterwards, we defined a weighted network from the bigram count and got our first graphs (for individual and institutional accounts):

  • Each word is going to represent a node.
  • Two words are going to be connected if they appear as a bi-gram.
  • The weight of an edge is the number of times the bi-gram appears in the corpus.

We also add some additional information to the visualization. We set the sizes of the nodes and the edges by the degree and weight respectively. We used the function strength to get the weighted degree.

Moreover, we extract the biggest connected component of the network to understand the most frequent conversation between gender public opinion leaders -individuals and institutions- in Twitter. We compute the clusters with a big threshold (100) and also with a smaller threshold (50). The last allows us to get a more complex network.

Finally, as Orduz (2018), we employ the Louvain Method for community detection. The precedent is an algorithm for detecting communities in networks. It evaluates how much more densely connected the nodes within a community are, compared to how connected they would be in a random network (neo4j, December 2020). It recursively merges communities into a single node and executes the modularity on the condensed graphs. We perform the method to check precisely the density of our connected nodes (words).

Bi-gram of individual accounts

Among the individuals account, we observe that the pair of words connected with higher weight within the network are (among others):

  • sexual, violence, work, harassment
  • persons, women, trans
  • human, rights
  • elites, political, drug trafficking
  • democracy, if not, destroy

This means that gender conversation leaders in Twitter tweet mostly about sexual violence and harassment at work, trans women, human right, etc. The last will give us clues for our models’ topic results.

With a big threshold (100), the biggest connected compound is not a surprise. However, when we decrease the threshold, we see a more complex network of words. It seems that most of the tweets of gender public opinion leaders are build on progressive movements who advocates for the rights of women and, particularly, trans women who have been killed.

Finally, the community detection results for individuals account show that four groups where identified and the modularity (measure for the “quality” of certain partition of the nodes in a network like clusterings) is 0.5 within the biggest connected compound of the word network. This result doesn’t seem either good or bad (better if closer to 1). However, we believe the 0.5 modularity is a good number to show the quality of the conversation’s density of connections (within words).

Bi-gram of institutional accounts

The following pairwise of words are the most frequent and relevant within institutions in gender conversations:

  • sexual, violence, intrafamily, abuse
  • persons, trans, lgbt, lgbti
  • rights, humans, sexual
  • women, victims, indigenous
  • digital, channels
  • armed, conflict

The following graphs allow us to conclude that the conversation among institutions, who aim gender equality -in their own way- in Colombia, is mainly about women’s victims. Specifically, rural, young and indigenous women. The last seems reasonable since structural inequalities affect the most indigenous and rural women. Furthermore, gender equality has been deeply discussed in peacebuilding conversations. Women are one of the groups most affected by the armed conflict in Colombia.

The community detection results show that 2 groups where identified and the modularity is 0.22. Institutions have a smaller modularity then individuals. It seems that the individuals conversation is more densily connected than institutional discourse on Twitter.

Conclusions

Conversations among activists, political leaders, writers, and all around opinion shapers from Colombia, including private or public institutions and NGOs in defense of women’s rights, are centered in women’s right. Who would say? What a surprise! Within individuals account, it seems that work harassment and sexual violence are in the center of the conversation. On the contrary, institutions are more focused on exposing and advocating for the injustices towards indigenous and rural women.

What topics can we identify?

We got our dfm for individual accounts and turned it into a stm corpus to run our topic model and defined 10 topics most prevalent in the tweets we had. And yes, we then did it again for the institutional accounts. We broadly identified what the 10 topics were for both the individual and the institutional accounts, although the model does not perfectly classify the documents according to our posterior interpretation. But all in all what we got seemed reasonable.

Topics in institutional accounts

Our chosen institutions spend a lot of time tweeting about women in the public sphere (well, duh…). They also use twitter to talk about their institutional events and work, which was also to be expected. And then it gets more interesting. Institutional accounts talk almost as much about social policy, as they talk about violence. And both the armed conflict and the truth commission figure prominently.

The model also picked up the conversation about reproductive rights that was sparked by an attempt made mid November to re-criminalize the three grounds on which abortion is currently legal in Colombia. In line with the decision of the supreme court -which upheld its 2006 verdict de-criminalizing abortions under certain circumstances-, the accounts we chose talk about abortion in terms of rights and access.

Another cluster of discourse formed around the LGBT community and the pandemic. This is probably due to the escalation of police violence against the LGBT community in their efforts to enforce curfews put in place due to the pandemic. But in true institutional spirit, this topic includes more words about dialogue, than about accountability.

Topics in individual accounts

Individual accounts center on women’s rights, which seems fairly obvious. It is however interesting that discourse here seems focused still on achieving equality with respect to men, which might be an initial indication of how far the debate on gender is in Colombia.
As with the institutional accounts, violence features very prominently, but here it is mostly connected to the state.
Individual accounts also comment often on wider topics of national politics and public opinion, which was perhaps to be expected, but raised our concerns about the classification of the documents our model was able to do.

Do topics differ by occupation?

We got curious about how different occupations might affect the prevalence of these topics in each account. And since we had that information, we went ahead and made more plots:

A couple of interesting outcomes:

  • Artists, activists and writers tweet the most about state violence.
  • Policy makers seem to have an opinion about everything, but mostly talk about journalism.
  • Economist seem very concerned about gender violence. Wait, what? Told you our model wasn’t perfect.

What is each topic about? Word-topic probabilities

Then, with the help of the LDA model, we calculated the probability of each word being generated from each topic (betas) and the ‘per-document-per-topic probabilities’: the proportion of words from that document that are generated from each topic.

We plotted the whole thing, inspired by Julia Silge’s blog, which is awesome and which you can check out here: https://juliasilge.com/blog/evaluating-stm/

Some things caught our eye:

  • Although most institutional events are in Bogotá, the talk about social policy seems to be all about the territory. Centralization strikes again…
  • Abortion is still top of the list in terms of reproductive rights for institutions, but is not so prominent in individual accounts.
  • National politics is talked about often in relation to op-eds.
  • Individual accounts talk about polarization, and often include words such as ‘dictatorship’, which might be indicative of that political climate.

Do these topics change in time?

After running the LDA model we wanted to observe if the topics change in time, se we plotted the proportion of documents from each topic by week from July to December of 2020. We analyzed the events that occurred during this period to get a better understanding of the trends the data presents.

Individual accounts

Some things caught our eye:

  • The highest proportion of documents talk about women´s rights and State violence
  • Women´s right had several peaks during this period, our intuition is this trend is due to the debates in Congress about abortion and gender parity on electoral lists from July to October, and the International Day for the Elimination of Violence against Women in November.
  • State violence was a recurrent topic too, we believe it is related with the cases of police brutality occurred in September in Bogota and the National Strake held in November.
  • National politics has a peak in August, this may be related to the extradition of the ex-paramilitary Salvatore Mancuso to Colombia and the possibility of his participation in the Special Peace Jurisdiction; this event created great polarization in the public opinion

Institutional accounts

Some things caught our eye:

  • The highest proportion of documents talk about women and institutional events
  • Violence has a peak between September and October that coincides with the peak in the individual accounts regarding State violence

How do they relate and differ from each other?

Now we move to positions! We want to know how each twitter account live in space and how they relate to each other. To do so, we scaled some accounts in one space based on the tweets discussing certain topics. The scale allows us to identify how close is each account from each other depending on the vocabulary they use regarding a topic, in this case we selected political topics and reproductive health topics.

It is important to mention that inside our selected categories (political topics and reproductive health) the accounts include information of other topics, hindering the precision of the analysis. It was not possible to do a clean subset of the twitter accounts that only discussed one topic. This is why, we believe the topic model has some limitations with twitter data and more manual classification is needed to improve precision.

How would they be distributed if put in a single space?

Tweets on politics

Analyzing the scale based on tweets discussing political topics, it is possible to observe that from the sample of 25 twitter accounts, only three of them position in the positive quadrant, one in 0 and the rest of them in the negative quadrant. These results imply that most of the twitter accounts selected, when discussing political topics, use similar vocabulary, so they are close to each other in the scale.

Lacadavidc, an activist, distances from the group positioning beyond -3, this could be due for her polemic positions against the women trans movement. In contrast, we observe alejaoficial, who positions almost in 1, she is a famous artist who advocates against women violence. The scale shows these two women, even though they talk about women´s rights, they do it using different vocabulary.

Tweets on reproductive health

What do the extremes of the scale seem to represent?

The institutional accounts scale discussing reproductive rights topics show us an interesting account grouping. The accounts close to 0 and the positive quadrant are institutions that advocate for women´s rights: Women_ Equity, Women Commission Colombia, Women Secretary and ONU Women, so these accounts mainly use vocabulary related to women and their reproductive rights. In the middle, between -2 and 0, the spectrum opens more, and we can find activists twitter accounts, these accounts are close to each other which means they use similar vocabulary and discuss similar topics, for instance, regarding abortion and the LGBT community. The last group is the Constitutional Court and the Truth Commission, we believe they are the extreme left because they use a different language compared to the other institutions or activists accounts, and they discuss a variety of topics regarding reproductive rights beyond women.

How are they connected on Twitter?

How do they feel?

After understanding how the accounts are positioned in space and how they relate to each other, we want to see how positive or negative is the vocabulary they use. To achieve this, we performed a sentiment analysis, there are different ways to do it, for instance, scaling models, classification models or dictionary models; we chose the last option!

We began by looking for a dictionary to analyze our data, it was quite a challenge, we must say, to find a good quality dictionary in Spanish. We, finally, picked the Full-Strength Lexicon dictionary (Perez, V., et al, 2012).

Using quanteda, we applied this dictionary to our corpus to both the personal and institutional accounts. As the personal accounts contained too many documents, we grouped them by main occupation; we didn´t have this issue with the institutional accounts which were analyzed individually. Exploring the results with the first dictionary we didn´t see a significant difference between the positive and negative score. So, we decided to perform a qualitative analysis of the dictionary, we found that the most frequent words used by the accounts in each topic (information gathered in the topic model section) were not included in the dictionary we selected. Thus, the words were included in the dictionary to improve the analysis. The results after running again the sentiment analysis improved.

Lets see what we got:

The sentiment score for individual accounts show that activists, journalists, and writers are the occupations that use a higher amount of positive and negative words in their vocabulary, in contrast with other occupations. Journalists have the highest number of words, this make sense as the main task of this occupation is to communicate and inform society about different topics, so their vocabulary uses a high number of words that can be classified in a positive-negative scale.

With the improved dictionary we captured a higher sample of words, the accounts use significantly more positive and negative words, and a higher difference in the polarity, between positive and negative scores.

The positive words increased from 7.000 to around 30.000 words, and the negative from 7.000 to around 13.000, we can infer that the sentiment towards the new words, added to the dictionary, was mainly positive. Moreover, the occupations with highest score continued to be activists, journalists, and writers, and the highest changed between the positive and negative score is in the journalists.

Regarding the institution’s accounts, the trend resembles to the personal accounts, in the first graph there is not much difference between the negative and positive score. The accounts that use the highest amount of positive and negative words are Legal Abortion Colombia, the Truth Commission, the Women´s Secretary.

In contrast, with the improved dictionary the word sample increased from 1.200 to 5.000 words and the variation within scores changed. We identified new peaks for positive words in Women Equity, Women Afro, Pacific Route, and Women´s Link accounts. Again, we believe the inclusion of the new words improved the quality of the sentiment analysis and shows us that the personal and institutional accounts tend to use a positive language in their tweets.

How do they feel about the issue of trans women?

How do they feel about the issue of sexual misconduct ?

How do they feel about the issue of abortion ?

How do they feel about the issue of sexual work?

Final remarks

References

https://juanitorduz.github.io/text-mining-networks-and-visualization-plebiscito-tweets/

https://neo4j.com/docs/graph-algorithms/current/algorithms/louvain/#:~:text=The%20Louvain%20method%20for%20community,for%20detecting%20communities%20in%20networks.&text=The%20Louvain%20algorithm%20is%20a,clustering%20on%20the%20condensed%20graphs.